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Abstract 


This  paper  presents  a  novel  algorithm  for  learning  in  a  class  of  stochas¬ 
tic  Markov  decision  processes  (MDPs)  with  continuous  state  and  action 
spaces  that  trades  speed  for  accuracy.  A  transform  of  the  stochastic 
MDP  into  a  deterministic  one  is  presented  which  captures  the  essence  of 
the  original  dynamics,  in  a  sense  made  precise.  In  this  transformed 
MDP,  the  calculation  of  values  is  greatly  simplified.  The  online  algo¬ 
rithm  estimates  the  model  of  the  transformed  MDP  and  simultaneously 
does  policy  search  against  it.  Bounds  on  the  error  of  this  approximation 
are  proven,  and  experimental  results  in  a  bicycle  riding  domain  are  pre¬ 
sented.  The  algorithm  learns  near  optimal  policies  in  orders  of  magni¬ 
tude  fewer  interactions  with  the  stochastic  MDP,  using  less  domain 
knowledge.  All  code  used  in  the  experiments  is  available  on  the 
project’s  web  site. 


This  work  was  funded  by  DARPA  as  part  of  the  "Natural  Tasking  of 
Robots  Based  on  Human  Interaction  Cues"  project  under  contract  num¬ 
ber  DA  BT  63-00-C-10102. 


1  Introduction 


There  is  currently  much  interest  in  the  problem  of  learning  in  stochastic  Markov  decision 
processes  (MDPs)  with  continuous  state  and  action  spaces  [2,  9,  10].  For  such  domains, 
especially  when  the  state  or  action  spaces  are  of  high  dimension,  the  value  and  2-func- 
tions  may  be  quite  complicated  and  difficult  to  approximate.  However,  there  may  be  rela¬ 
tively  simple  policies  which  perform  well.  This  has  lead  to  recent  interest  in  policy  search 
algorithms,  in  which  the  reinforcement  signal  is  used  to  modify  the  policy  directly  [5,  6, 
10]. 

For  many  problems,  a  positive  reward  is  only  achieved  at  the  end  of  a  task  if  the  agent 
reaches  a  “goal”  state.  For  complex  problems,  the  probability  that  an  initial,  random  pol¬ 
icy  would  reach  such  a  state  could  be  vanishingly  small.  A  widely  used  methodology  to 
overcome  this  is  shaping  [1,  3,  4,  8].  Shaping  is  the  introduction  of  small  rewards  to 
reward  partial  progress  toward  the  goal.  A  shaping  function  eases  the  problem  of  backing 
up  rewards,  since  actions  are  rewarded  or  punished  sooner. 

When  a  policy  changes,  estimating  the  resulting  change  in  values  can  be  difficult, 
requiring  the  new  policy  to  interact  with  the  MDP  for  many  episodes.  In  this  paper  we 
introduce  a  method  of  transforming  a  stochastic  MDP  into  a  deterministic  one.  Under  cer¬ 
tain  conditions  on  the  original  MDP,  and  given  a  shaping  reward  of  the  proper  form,  the 
deterministic  MDP  can  be  used  to  estimate  the  value  of  any  policy  with  respect  to  the  orig¬ 
inal  MDP.  This  leads  to  an  online  algorithm  for  policy  search:  simultaneously  estimate  the 
parameters  of  a  model  for  the  transformed,  deterministic  MDP,  and  use  this  model  to  esti¬ 
mate  both  the  value  of  a  policy  and  the  gradient  of  that  value  with  respect  to  the  policy 
parameters.  Then,  using  these  estimates,  perform  gradient  descent  search  on  the  policy 
parameters.  Since  the  transformation  captures  what  is  important  about  the  original  MDP 
for  planning,  we  call  our  method  the  “essential  dynamics”  algorithm. 

The  next  section  gives  an  overview  of  the  technique,  developing  the  intuition  behind 
it.  In  section  3  we  describe  the  mathematical  foundations  of  the  algorithm,  including 
bounds  on  the  difference  between  values  in  the  original  and  transformed  MDPs.  Section  4 
describes  an  application  of  this  technique  to  learning  to  ride  a  bicycle.  The  last  section 
discusses  these  results,  comparing  them  to  previous  work.  On  the  bicycle  riding  task, 
given  the  simulator,  the  only  domain  knowledge  needed  is  a  shaping  reward  that  decreases 
as  lean  angle  increases,  and  as  angle  to  goal  increases.  Compared  to  previous  work  on  this 
problem,  a  near  optimal  policy  is  found  in  dramatically  less  simulated  time,  and  with  less 
domain  knowledge. 

2  Overview  of  the  Essential  Dynamics  Algorithm 

In  the  essential  dynamics  algorithm  we  learn  a  model  of  how  state  evolves  with  time,  and 
then  use  this  model  to  compute  the  value  of  the  current  policy.  In  addition,  if  the  policy 
and  model  are  from  a  parameterized  family,  we  can  compute  the  gradient  of  the  value  with 
respect  to  the  parameters. 

In  putting  this  plan  into  practice,  one  difficulty  is  that  state  transitions  are  stochastic, 
so  that  expected  rewards  must  be  computed.  One  way  to  compute  them  is  to  generate 
many  trajectories  and  average  over  them,  but  this  can  be  very  time  consuming.  Instead  we 
might  be  tempted  to  estimate  only  the  mean  of  the  state  at  each  future  time,  and  use  the 
rewards  associated  with  that.  However,  we  can  do  better.  If  the  reward  is  quadratic,  the 
expected  reward  is  particularly  simple.  Given  knowledge  of  the  state  at  time  f,  we  can 
then  talk  about  the  distribution  of  possible  states  at  some  later  time.  For  a  given  distribu¬ 
tion  of  states,  let  s  denote  the  expected  state.  Then 

£[rO)]  =  ^(a(s -s)2  +  b(s -s)  +  c)P(s)ds  =  tfvar(.s)  +  b(s  -s)  +  c  =  avar(s)  +  c  (1) 
where  ayb  &c  depend  on  s . 
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Suppose  the  policy  depends  on  a  vector  of  parameters  0.  When  interacting  with 
the  MDP,  at  every  time  t  after  having  taken  action  at_\  in  state  and  arriving  in 
state  st : 

1. 

2.  v(sf  _  i >  _  j)  ^  (Sf  ~~  P($f_  j,  _  j)) 

3.  st  =  st 

4.  a,2  =  0 

5.  V  =  0 

6.  For  every  rin  f+1  ..  f+n: 

a.  sx  = 

b.  Ox  =  v(?x.  i,  t))  +  dx_  i(|i'»t(Jx_ ,)) 

c.  rx  =  ir"(ix)°t  +  f(sx) 

d.  V=V  +  y~'?x 

dV 

7.  Update  the  policy  in  the  direction  that  increases  V :  0  =  0  + 

(70 

Figure  1:  The  essential  dynamics  algorithm  for  a  one  dimensional  state  space.  The  nota¬ 
tion  /( jc)  <-  a  means  “adjust  the  parameters  that  determine  /to  make/jc)  closer  to  a,”  e.g. 
by  gradient  descent.  pV  is  the  derivative  of  p($,  k(s))  with  respect  to  s. 


Thus,  to  calculate  the  expected  reward,  we  don’t  need  to  know  the  full  state  distribu¬ 
tion,  but  simply  its  mean  and  variance.  Thus,  our  model  should  describe  how  the  mean 
and  variance  evolve  over  time.  If  the  state  transitions  are  “smooth,”  they  can  be  approxi¬ 
mated  by  a  Taylor  series.  Let  n  be  the  current  policy,  and  let  p^Cy)  denote  the  expected 
state  that  results  from  taking  action  k(s)  in  state  s.  If  st  denotes  the  mean  state  at  time  r, 

and  o~  the  variance,  and  if  state  transitions  were  deterministic,  then  to  first  order  we 
would  have 


1  “  Hictfi) 


where  p^'  is  the  derivative  of  [in  with  respect  to  state.  For  stochastic  state  transitions,  let 
be  the  variance  of  the  state  that  results  from  taking  action  rc(s)  in  state  s.  It  turns  out 
that  the  variance  at  the  next  time  step  is  simply  vK(, s)  plus  the  transformed  variance  from 
above,  leading  to 


Thus,  we  learn  estimates  p  and  v  of  p  and  v  respectively,  use  Eq.  (2)  to  estimate  the 
mean  and  variance  of  future  states,  and  Eq.  (1)  to  calculate  the  expected  reward.  The 
resulting  algorithm,  which  we  call  the  expected  dynamics  algorithm,  is  presented  in 
Figure  1. 
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The  next  section  gives  a  formal  derivation  of  the  algorithm,  and  proves  error  bounds 
on  the  estimated  state,  variance,  reward  and  value  for  the  general  n-dimensional  case, 
where  the  reward  is  only  approximately  quadratic. 


3  Derivation  of  the  Essential  Dynamics  Algorithm 


A  Markov  Decision  Process  (MDP)  is  a  tuple  <5,  D,  A,  Ps  a,  r,  y)  where:  S  is  a  set  of 
states ;  D:S  — >  IR  is  the  initial-state  distribution;  A  is  a  set  of  actions ;  Ps  a  :S  — >  R  are  the 
transition  probabilities;  r:S  x  A  — >  IR  is  the  reward;  and  y  is  the  discount  factor.  This 
paper  is  concerned  with  continuous  state  and  action  spaces,  in  particular  we  assume 

ns  na 

S  =  U  '  and  A  =  U  .We  use  subscripts  to  denote  time  and  superscripts  to  denote 
components  of  vectors  and  matrices.  Thus,  sj  denotes  the  ith  component  of  the  vector  5  at 
time  t. 

A  (deterministic)  policy  is  a  mapping  from  a  state  to  the  action  to  be  taken  in  that 
state,  k  :  S  — >  A  .  Given  a  policy  and  a  distribution  Pt  of  states  at  time  f,  such  as  the  initial 
state  distribution  or  the  observed  state,  the  distribution  of  states  at  future  times  is  defined 

by  the  recursive  relation  Px  +  ,(^)  =  Js.P5’>7C(y)(s)/>T(s,)d#  for  x  >  t .  Given  such  a  distribu¬ 
tion,  we  can  define  the  expectation  and  the  covariance  matrix  of  a  random  vector  x  with 
respect  to  it,  which  we  denote  E,[jc]  and  cov,(jc)  respectively.  Thus,  Ef[jt]  =  ^xPfx)dx 

and  cov'/V)  =  E/[(^f-E/[^i])(x'-E,[jc/])] .  When  Pt  is  zero  except  for  a  single  state 
st  we  introduce  E[jc|s,]  as  a  synonym  for  E,[jc]  which  makes  the  distribution  explicit. 
Given  an  MDP,  we  define  the  limited  horizon  value  function  for  a  given  policy  as 

n 

VK(st)  =  yx~'Ex[r(sxy  7T(at))]  where  the  probability  density  at  time  t  is  zero  except  for 

i  =  t 

state  st  Also  given  a  policy,  we  define  two  functions,  the  mean  p^s)  a°d  covariance 


matrix  vK(s)  of  the  next  state.  Thus,  p^G,)  =  E[$r+1|s,]  and  v^Os,)  = 

E[(.s,+ 1  -  p*($|))(.sf+ 1  -  pA))7!*].  1°  policy  search ,  we  have  a  fixed  set  of  policies  n 
and  we  try  to  find  one  that  results  in  a  value  function  with  high  values. 

We  transform  the  stochastic  MDP  M  to  a  deterministic  one  A/’  =  <S\  s0\  A',f,  r\  y) 
as  follows.  A  state  in  the  new  MDP  is  an  ordered  pair  consisting  of  a  state  from  S  and  a 
covariance  matrix,  denoted  (5,  I).  The  new  initial  state  Sq  =  (ED[^],  covD[.s]).  The 
new  action  space  is  the  set  of  all  possible  policies  for  Af,  that  is  A'  =  { 7t  1 7C :  A  — >  S  } .  The 
state  transition  probabilities  are  replaced  with  a  (deterministic)  state  transition  function 
/(j'r  a\) ,  which  gives  the  unique  successor  slate  that  results  from  taking  action  a\  =  n 

in  state  s',  =  (s„  Z,) .  We  set  f(s’„  a',)  =  f(s„  I„  7t)  =  (nn(.?,),  vn(s,)  +  (Vm)Z,(VnK)r). 


The  reward  r'(5, 1,  n)  =  r(s)  +  ^tr( 

I)  where 

2 

lUs) 

2 

Lai#  J 

[dW  J 

denotes  the  matrix  of  sec¬ 


ond  derivatives  of  r  with  respect  to  each  state  variable.  Finally,  y  =  y. 

The  strength  of  the  method  comes  from  the  theorems  below,  which  state  that  the 
above  transform  approximately  captures  the  dynamics  of  the  original  probabilistic  MDP 
to  the  extent  that  the  original  dynamics  are  “smooth.”  The  first  theorem  bounds  the  error 
in  approximating  state,  the  second  in  covariance,  the  third  in  reward  and  the  fourth  in 
value. 


Theorem  1  Fix  a  time  t,  a  policy >  k  ,  and  a  distribution  of  states  Pr  Choose  and  M 
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such  that  Vs, 


<Mp,  ||V(i7C(ir)||  <  M  and  ||covr(5r  ■s'/)|| /r  <  M »  where 

||  ||  F  denotes  the  Frobenius  norm.  Let  st  be  given ,  and  define  sr+  \=\lK(st), 
£?  =  E([st]  —  sr  and  Ef+ j  =  Ef[$f+ —  J/+ j .  Then  ||ef+  ]||  <  (|e?|  +  ^  1^/11 


"JIS5 


<MV. 


Theorem  2  Suppose  My  and  M  are  chosen  so  that  Vs, 

|E,[|5f-E,[s,]|*]||F<Af  for  k  =  1,  2,  3,  4,  ||j,  +  1|  =  |||X]C(Sr)|  <  M  and  all  the  conditions 
of  Theorem  1.  Let  Zt  be  given,  and  define  Ilf  \  =  v‘J(Jf)  +  (V|i'(S/))7'If(V^(5r)).  Let 
zf  =  co  vt(sr  st)  -  Zt ,  similarly  for  zf+  j .  Then 

|®?+  i|FS  (|efl|,  +  Jefll  +W,  +  Mv)MH\0  +  0(||e,f) 


Theorem  3  Suppose  Vs, 


u?.l®WS>)  1.1, 


<  Af  and 


||  Vr(s,)||  <  M  and  the  conditions  of  the  previous  two  theorems.  Let  Z[  =  Er[r(s,)]  -  r'(st). 


Then  E,[r(s,)]  =  r'(st)  +  zf  =  r(st)  +  -tr( 


2 

3r 


Lr)  +  Zf  vv/iere 


didj_ 

K1  <  (fcT+  Nl  +  Afr)gw  +  0(|e?|)) 


Theorem  4  Fix  a  time  t  and  a  policy  K,  and  a  distribution  of  states  Pt  Let  st  and  Z,  be 
given,  and  define  sx  and  Zx  for  x  =  t  +  1 . . .  t  +  n  recursively  as  in  theorems  1  and  2 
above.  Let  Mzr  be  an  upper  bound  for  |e{|  for  all  X  g  [t,  t  +  n] .  Then  under  the  condi¬ 


tions  of  the  above  three  theorems,  E[  V(.sr)]  =  V(Jf)  +  zf  where  zf  < 


1  -  Y"+  1 

1-Y 


Me 


Proof:  First,  some  preliminaries.  In  the  first  three  theorems,  which  deal  only  with  a  single 
transition  and  a  single  distribution  of  states  at  time  r,  namely  Pt,  let  x  =  EP  [jr]  for  any 
random  variable  x.  Note  that  for  any  vector  x  and  square  matrices  A  and  B, 
xTAx  =  tr(A(xxr))  where  tr(.)  denotes  the  trace  of  a  matrix,  |tr(AZ?)|  <  ||A||F||#||F,  and 

Wxx^f  =  Ul2.  In  the  statement  of  theorem  2,  Er[|(^f  —  sf)3|  |  is  a  three  dimensional 

matrix  whose  ij,  /: element  is  Ef[|($|- s/)(s/- s{)(sf  -  j*)|]  .  Similarly,  E,[|(.sf-.s,)4|] 
is  a  four  dimensional  matrix,  and  if  all  of  its  elements  are  finite,  then  the  lower  powers 
must  also  be  finite.  The  Frobenius  norm  of  such  matrices  is  simply  the  square  root  of  the 
sum  of  the  squares  of  all  their  elements.  Also,  if  a,  b,  c  &  d  are  real  numbers  that  are 
greater  than  zero,  then  ab  +  cd  <  (a  +  c)(b  +  d) . 

Note  that,  since  [iK  is  a  vector  valued  function,  Vp^Cs)  is  a  matrix.  Since  p£  ,  the  ith 


component  of  p* ,  is  a  real  valued  function,  Vp^Cs)  g  R"5 .  Because  v(,s)  is  a  matrix, 

n2 


v*J(j)€  R.  Let 


djdk 


KM 


denote  the  matrix  of  second  partial  derivatives  of  pi ,  evalu¬ 


ated  at  jcg  R"5.  For  any  s,  \el  A{  =  s  -  st,  A 2  =  st-st  and  A  =  A(  +  A2  =  s  -st. 
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Thus.  EP[A]  =  A2  and  E,|AAr]  =  E,(A,a[]  +  A2a[  =  Z,  +  A2A2  =  £,  +  £*  +  A2A2. 
Note  that  A2  =  ef . 

Proof  of  Theorem  1:  Expand  |i£(.s)  using  a  first  order  Taylor  series  with  the  Lagrange 
form  of  the  remainder,  namely 


H*(s)  =  nitfr)  +  Vni(J,)r(i-s)  +  -(5-i,)7' 


djdk 


(S-St), 


(3) 


i.e. 


=  ui(S,)  +  Vpja^A  +  ^Ar 

for  some  x  on  the  line  joining  s  and  s, .  Then 

Ej>,n+,]-s;+i  =  EP(ijxi(i()]-j4(jf) 

=  ni(i,)  +  V4i(i,)rA2  +■  itr( 


|h4C*) 


djdk 


(4) 


djdk 


(2,  +  A2A2))  -  ni(s,) 


So  |e?+  ,||  <  M[t}\\  +  X-M^M  +  |e,f )  <  (||ef||  +  *#„)(*#  +  i(A/  +  |e,f ))  . 


Proof  of  Theorem  2:  Let  Mf  =  j|E,[ |(s,  —  J()*|]|f  •  By  the  mean  value  theorem, 

v '■•'(.?)  =  +  Vv'  -'U)  •  A  for  some  *  on  the  line  joining  s  and  s, . 

Also .v’-'Cs,)  =  E[i'+  ,ij+ ,  |  j-f]  -  p'(s,)p'(.s,)  so  that 

co  vp(s!+l,sj+1)  =  E[s;+1j/+1]-s/+)i/+1 

=  Ef([E[j/+  ,s/+ ,  |i,]]  -i/+iW+t 

=  v‘-j(S,)  +  EpIVV’-'W  •  A]  +  Ep[n'(i,)^(5,)]  -  sj+lsf~i 


The  second  term  is  an  error  term,  call  it  .  We  have  |ej  <  A/VM,' .  For  the  third 
term,  we  expand  both  p'  and  using  Eq.  (4)  and  multiplying  out  the  terms,  obtaining 

Ep[p '(.?,)  p;(5,)]  =  p'(f,)p>(i,) 

+  pi(i,)Vp^(I,)7'A2  +  p/(f,)Vp'(i,)rA2 


+  Vp'(S, )T(i,  +  ef  +  A2A2)Vp>(j,) 


+  |p,(ir)TEp| 


Ar 


aita/ 


Piw 


Ar 


a*a/ 


Piu) 


+  -Vp'(I()7Ep 


+  7e/> 
4  < 


AA 


ajta/ 


AA 


a*a/ 


a*a/ 


ni(*) 


AA 


dkdl^(x> 


All  terms  other  than  the  first  and  the  one  involving  1/  are  error  terms,  call  their  sum 
e"1*'  .  That  is, 

Ep[p'(s,)p>(5,)l  =  p'(i()p'(j,)  +  Vpi(j,)r£,Vp.'(i()  +  £"'•> 
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where 


He"»  <  2|5f+I||Vn(J,)||||ef|  + 1| V|i( Jr)|| 2( |[ef | 2  + 

+  p,+ .1  m,m2-  +  ||Vn(i,)||MMw,'  + 

Lastly  let  e2H  =  J-Li(5f)M^(^r>  -s!+isf+i  •  By  Theorem  1, 

E"'u  =  H'(S,)^(J,)-5;+15/+1 

=  n'(S,)n/(i,)  -  Oi'(j,)  +  efO(^(i,)  +  ef>) 

=  -  4'(i()£f‘  -  n>(s,)ef‘  -  ef‘efj 

Substituting  into  Eq.  (5),  we  obtain: 

covP(5{+ !,  5/+  i)  =  v^(J,)  +  z'U  +  V  |X*(5f)  r£/ V  |i/(  5f)  +  e"U  + 
so  that  e,z+  {  =  e’  +  e”  +  e’"  and 

|  e?+  .Ilf  <  +  2M2||e/||  +  M2(||e;j|2  +  |e?|)  +  +  WW/5'  +  X-S f>/ 

+  2(|ef|  +  Af,)(|A#  +  V +  <le?l +  mAIm  +  ImT 

Each  term  has  at  least  one  of  the  “small  bounds"  |ef|,  |e,z| ^  or  Mv .  Using  the 
inequality  from  the  preliminaries,  we  can  “factor  them  out."  The  four  Mk'  are  bounded  by 
M  +  0(|e/||) ,  as  can  be  shown  using  the  binomial  theorem,  e.g. 

E,[|A,  +  A2|3]  <  E,[(|A,|  +  |A2|)3] 

=  E,[|A,|3]  +  3|A2|E,r|A,|2]  +  3|A2|2E,[|A,|]  +  |A2|3 
=  E,[|A||3]  +■  0(||ef|)  m 

Proof  of  Theorem  3:  Expand  r(^)  using  a  second  order  Taylor  series  with  the  Lagrange 
form  of  the  remainder,  namely 


r{s)  =  r(s,)  +  VKi,)r  •  A  +  | Ar 


2 

dr 


didj 


:s() 


A+J  i  a3r 


6.,/*  =  ia'a./'a* 


<x)A''A>A* . 


(6) 


Call  the  last  term  e' .  Thus, 

E,[r(j)]  =  ris,)  +  Vr[s,)T  ■  A2  +  ^ti< 


=  r’(s,)  +  e,r 


^41) 
didj  _ 


(£,  +  e?  +  A2A2r))  +  E,[e'] 


le/1  <  1^)1  IK1 +  ^Llf  +  IKf  w + 

<  (|lei|  +  Klf  +  m,)(m  +  l-M  +  l-M\\z;W  +  iw3') 


Proof  of  Theorem  4: 
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E[V(i,)]  =y  Yt_'Et[K^)] 

X  =  t 

=  VY'V(it)  +  eO 
=  V(if)  +  y  Y^'e; 

t  =  r 

So, 


The  above  theorems  state  that  as  long  as  |ef||,  \\£?\\F,  and  Mr  are  small  and  M 

is  finite,  and  given  a  good  estimate  of  the  mean  and  covariance  of  the  state  at  some  time, 
the  transformed  MDP  will  result  in  good  estimates  at  later  times,  and  hence  the  reward  and 
value  functions  will  also  be  good  estimates.  Note  that  no  particular  distribution  of  states  is 
assumed,  only  that,  essentially,  the  first  four  moments  are  bounded  at  every  time.  The 
most  unusual  conditions  are  that  the  reward  r  be  roughly  quadratic,  and  that  the  value 
function  include  only  a  limited  number  of  future  rewards.  This  motivates  the  use  of  shap¬ 
ing  rewards. 

4  Experiments 

The  code  used  for  all  experiments  in  this  paper  is  available  from  www.metahuman.org/ 
marti  n/Research  .html . 

The  essential  dynamics  algorithm  was  applied  to  Randl0v  and  Alstr0m’s  bicycle 
riding  task  [8],  with  the  objective  of  riding  a  bicycle  to  a  goal  1  km  away.  The  five  state 
variables  were  simply  the  lean  angle,  the  handlebar  angle,  their  time  derivatives,  and  the 
angle  to  the  goal.  The  two  actions  were  the  torque  to  apply  to  the  handlebars  and  the  hor¬ 
izontal  displacement  of  the  rider’s  center  of  mass  from  the  bicycle’s  center  line.  The  sto- 
chasticity  of  state  transitions  came  from  a  uniform  random  number  added  to  the  rider’s 
displacement.  If  the  lean  angle  exceeded  tt/15,  the  bicycle  fell  over  and  the  run  termi¬ 
nated. 

If  the  variance  of  the  state  is  not  too  large  at  every  time  step,  then  the  variance  term  in 

the  transformed  reward  can  simply  be  considered  another  form  of  error,  and  only  |i  need 
be  estimated.  This  was  done  here.  A  continuous  time  formulation  was  used  where, 
instead  of  estimating  the  values  of  the  state  variables  at  a  next  time,  their  derivatives  were 
estimated.  The  model  was  of  the  form 

3c  * 

—  =  a)  =  w'  •  c p($,  a) 

where  (p(.s,  a)  was  a  vector  of  features  and  w'  was  a  vector  of  weights.  The  features  were 
simply  the  state  and  action  variables  themselves.  The  derivative  of  each  state  variable  was 
estimated  using  gradient  descent  on  w,  with  the  error  measure  err{  -  |i'  -  w,  •  <p(.y,  <z)| 
and  a  learning  rate  of  1 .0.  This  error  measure  was  found  to  work  better  than  the  more  tra¬ 
ditional  squared  error.  The  squared  error  is  minimized  by  the  mean  of  the  observed  val¬ 
ues,  whereas  the  absolute  value  is  minimized  by  the  median  [7].  The  median  is  a  more 
robust  estimate  of  central  tendency,  i.e.  less  susceptible  to  outliers,  and  therefore  may  be  a 
better  choice  in  many  practical  situations. 

Model  estimation  was  done  online,  simultaneous  with  policy  search.  In  the  continu¬ 
ous  formulation,  the  value  function  is  the  time  integral  of  the  reward  times  the  discount 
factor.  The  future  state  was  estimated  using  Euler  integration  [7].  While  the  bicycle  sim¬ 
ulator  also  used  Euler  integration,  these  choices  were  unrelated.  In  fact.  At  =  0.01  s  for  the 
bicycle  simulator  and  0.051s  for  integrating  the  estimated  reward.  It  was  integrated  for  30 
time  steps. 
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Figure  2:  The  left  graph  shows  length  of  episode  vs.  training  time  for  10  runs.  The 
dashed  line  indicates  the  optimal  policy.  Stable  riding  was  achieved  within  200  simulated 
seconds.  The  right  graph  shows  angle  to  goal  vs.  time  for  a  single  episode  starting  after 
3000  simulated  seconds  of  training. 


The  shaping  reward  was  the  square  of  the  angle  to  goal  plus  10  times  the  square  of  the 
lean  angle.  The  policy  was  a  weighted  sum  of  features,  with  a  small  Gaussian  added  for 
exploration,  71(5-)  =  0  •  cpG)  +  M0,  0.05).  The  features  were  simply  the  state  variables 
themselves.  When  the  model  is  poor  or  the  policy  parameters  are  far  from  a  local  opti¬ 
mum,  0V/00  can  be  quite  large,  resulting  in  a  large  gradient  descent  step  which  may 
overshoot  its  region  of  applicability.  This  can  be  addressed  by  reducing  the  learning  rate, 
but  then  learning  becomes  interminably  slow.  Thus,  the  gradient  descent  rule  was  modi- 


Near  an  optimum,  when  ||3W00||  «  p,  this  reduces  to 


^  A  00  _  3V/00 

fied  to  a(p  +  ||av/ae|)- 

the  usual  rule  with  a  learning  rate  of  a/(3.  In  this  experiment,  a  =  0.01  and  p  =  1.0. 

A  graph  of  episode  time  vs.  learning  time  is  shown  in  Figure  1.  After  falling  over 
between  40  and  60  times,  the  controller  was  able  to  ride  to  the  goal  or  the  time  limit  with¬ 
out  falling  over.  After  a  single  such  episode,  it  consistently  rode  directly  to  the  goal  in  a 
near  minimum  amount  of  time.  The  resulting  policy  was  essentially  an  optimal  policy. 


5  Discussion 

For  learning  and  planning  in  complex  worlds  with  continuous,  high  dimensional  state  and 
action  spaces,  the  goal  is  not  so  much  to  converge  on  a  perfect  solution,  but  to  find  a  good 
solution  within  a  reasonable  time.  Such  problems  often  use  a  shaping  reward  to  accelerate 
learning.  For  a  large  class  of  such  problems,  this  paper  proposes  approximating  the  prob¬ 
lem’s  dynamics  in  such  a  way  that  the  mean  and  covariance  of  the  future  state  can  be  esti¬ 
mated  from  the  observed  current  state.  We  have  shown  that,  under  certain  conditions,  the 
rewards  in  the  approximate  MDP  are  close  to  those  in  the  original,  with  an  error  that 
grows  boundedly  as  time  increases.  Thus,  if  the  rewards  are  only  summed  for  a  limited 
number  of  steps  ahead,  the  resulting  values  will  approximate  the  values  of  the  original  sys¬ 
tem.  Learning  in  this  transformed  problem  is  considerably  easier  than  in  the  original,  and 
both  model  estimation  and  policy  search  can  be  achieved  online. 

The  simulation  of  bicycle  riding  is  a  good  example  of  a  problem  where  the  value 
function  is  complex  and  hard  to  approximate,  yet  simple  policies  produce  near  optimal 
solutions.  Using  a  traditional  value  function  approximation  approach,  Randl0v  needed  to 

augment  the  state  with  the  second  derivative  of  the  lean  angle  (Q )  and  provide  shaping 
rewards  [8].  The  resulting  algorithm  took  1700  episodes  to  ride  stably,  and  4200  episodes 
to  get  to  the  goal  for  the  first  time.  The  resulting  policies  tended  to  ride  in  circles  and  pre- 
cess  toward  the  goal,  riding  roughly  7  km  to  get  to  a  goal  1  km  away. 

In  contrast,  when  the  action  is  a  weighted  sum  of  (very  simple)  features,  random 
search  can  find  near  optimal  policies.  This  was  tested  experimentally;  0.55%  of  random 

policies  consistently  reached  the  goal  when  Q  was  included  in  the  state,  and  0.30%  did 
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when  it  wasn’t.1  What’s  more,  over  half  of  these  policies  had  a  path  length  within  1  %  of 
the  best  reported  solutions.  Policies  that  rode  stably  but  not  to  the  goal  were  obtained 
0.89%  and  0.24%  of  the  time  respectively.  Thus,  a  random  search  of  policies  needs  only 
a  few  hundred  episodes  to  find  a  near  optimal  policy. 

The  essential  dynamics  algorithm  consistently  finds  such  near  optimal  policies,  and 
the  author  is  aware  of  only  one  other  algorithm  which  does,  the  PEGASUS  algorithm  of 
[5].  The  experiments  in  this  paper  took  40  to  60  episodes  to  ride  stably,  that  is,  to  the  goal 
or  until  the  time  limit  without  falling  over.  After  a  single  such  episode,  the  policy  consis¬ 
tently  rode  directly  to  the  goal  in  a  near  minimum  amount  of  time.  In  contrast,  PEiGASUS 
used  at  least  450  episodes  to  evaluate  each  policy.2  One  reasonable  initial  policy  is  to 
always  apply  zero  torque  to  the  handlebars  and  zero  displacement  of  body  position.  This 
falls  over  in  an  average  of  1.74  seconds,  so  PEGASUwS  would  need  780  simulated  seconds 
to  evaluate  such  a  policy.  The  essential  dynamics  algorithm  learns  to  ride  stably  in 
approximately  200  simulated  seconds,  and  in  the  second  780  simulated  seconds  will  have 
found  a  near  optimal  policy. 

This  was  achieved  using  very  little  domain  knowledge.  Q  was  not  needed  in  the 
state,  and  the  features  were  trivial.  The  essential  dynamics  algorithm  can  be  used  for 
online  learning,  or  can  learn  from  trajectories  provided  by  other  policies,  that  is,  it  can 
“learn  by  watching.”  In  the  bicycle  experiment,  the  essential  dynamics  algorithm  needed 
many  times  more  computing  power  per  simulated  second  than  PEGASUS,  although  it  was 
still  faster  than  real  time  on  a  1GHz  mobile  Pentium  III,  and  therefore  could  presumably 
be  used  for  learning  on  a  real  bicycle.  The  experiments  in  section  4  added  the  square  of 
the  lean  angle  to  the  shaping  reward,  but  did  not  use  any  information  about  dynamics  (i.e. 
velocities  or  accelerations),  nor  about  the  handlebars.  In  fact,  the  shaping  reward  simply 
corresponded  to  the  common  sense  advice  “stay  upright  and  head  toward  the  goal.” 

However,  these  advantages  do  not  come  without  drawbacks.  The  essential  dynamics 
algorithm  only  does  policy  search  in  an  approximation  to  the  original  MDP,  so  an  optimal 
policy  for  this  approximate  MDP  won’t,  in  general,  be  optimal  for  the  original  MDP.  The 
theorems  in  section  3  give  bounds  on  this  error,  and  for  bicycle  riding  this  error  is  small. 

Conclusion 

This  paper  has  presented  an  algorithm  for  online  policy  search  in  MDPs  with  continuous 
state  and  action  spaces.  A  stochastic  MDP  is  transformed  to  a  deterministic  MDP  which 
captures  the  essential  dynamics  of  the  original.  Policy  search  is  then  be  performed  in  this 
transformed  MDP.  Error  bounds  were  given  and  the  technique  was  applied  to  a  simulation 
of  bicycle  riding.  The  algorithm  found  near  optimal  solutions  with  less  domain  knowl¬ 
edge  and  orders  of  magnitude  less  time  than  existing  techniques. 
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1.  Our  experiment  contained  two  conditions,  namely  with  or  without  Q,  in  the  state,  result¬ 
ing  in  5  or  6  state  variables.  The  features  were  the  state  variables  themselves,  state  and  action  vari¬ 
ables  were  scaled  to  roughly  the  range  [-1,  +1],  weights  were  chosen  uniformly  from  [-2,  +2],  and 
each  policy  was  run  30  times.  In  100,000  policies  per  condition,  549  (0.55%)  reached  the  goal  all 

30  times  when  Q  was  included,  and  300  (0.30%)  when  it  wasn't.  For  such  policies,  the  median 
riding  distance  was  1009  m  and  1008  m  respectively.  The  code  used  is  available  on  the  web  site. 

2.  [5]  evaluated  a  given  policy  by  simulating  it  30  times.  The  derivative  with  respect  to  each 
of  the  15  weights  was  evaluated  using  finite  differences,  requiring  another  30  simulations  per 
weight,  for  a  total  of  30x15  =  450  simulations.  Often,  the  starting  weights  at  a  given  stage  were 
evaluated  during  the  previous  stage,  so  only  the  derivatives  need  to  be  calculated. 
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